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8M5, Canada. 

The influenza A virus RNA polymerase cleaves the 5' end of host pre-mRNAs and uses the capped RNA 
fragments as primers for viral mRNA synthesis. We performed deep sequencing of the 5' ends of viral 
mRNAs from all genome segments transcribed in both human (A549) and mouse (M-1) cells infected with 
the influenza A/HongKong/1/1968 (H3N2) virus. In addition to information on RNA motifs present, our 
results indicate that the host primers are divergent between the viral transcripts. We observed differences in 
length distributions, nucleotide motifs and the identity of the host primers between the viral mRNAs. 
Mapping the reads to known transcription start sites indicates that the virus targets the most abundant host 
mRNAs, which is likely caused by the higher expression of these genes. Our findings suggest negligible 
competition amongst RdRprvRNA complexes for individual host mRNA templates during cap-snatching 
and provide a better understanding of the molecular mechanism governing the first step of transcription of 
this influenza strain. 



nfluenza viruses belong to the family Orthomyxoviridae, consisting of enveloped viruses that contain single- 
stranded negative-sense segmented RNA genomes. Among the influenza virus genera (A, B and C), influenza 
I A virus (lAV) is the most virulent, and causes significant worldwide mortality and morbidity. The genome of 
lAV consists of eight segments, which encode at least 1 1 known proteins (reviewed in'). Upon internalization, the 
viral genome segments are released into the cytoplasm as vRNPs, which are transported into the nucleus using the 
importin a/importin |3 pathway. At an early step of the lAV replication cycle, the viral RNA-dependent RNA 
polymerase (RdRp) produces viral mRNAs, which have features of cellular mRNAs, including a 5' cap structure 
and a poly(A) taiP"^. During the replication phase, the viral RdRp produces complementary liNAs (cRNAs) from 
the viySIAs, which are then used as templates to produce progeny vliNAs. These vRNAs serve as templates for the 
synthesis of more viral mRNAs or are exported to become incorporated into new virus particles. Notably, the viral 
RdRp is required for both of these steps. 

One of the most intriguing aspects of lAV transcription is that synthesis of viral mliNAs is dependent on 
capped RNA primers derived from host pre-mRNAs. During this process, known as "cap-snatching", the viral 
RdRp interacts with the C-terminal domain of host RNA polymerase II (RNAP II) to gain access to the 5 ' cap 
structure of nascent mRNAs". Following RdRp endonucleolytic activity, the capped RNA fragments prime viral 
mliNA transcription'"". Because the excised capped RNA fragments also contain 10-15 nucleotides downstream 
of the cap, the priming activation of viral transcription yields products that are genetic hybrids of host and virus 
mRNAs; as a result, sequence heterogeneity is found at the host-derived 5 ' ends of viral mRNAs^ As cleavage of 
caps by lAV also involves the destruction of the pre-mRNAs, which potentially induces RNAP II degradation, the 
antagonist properties of RdRp on host RNAP II transcription have been implicated in host genes shut-off 
(reviewed in"). 

The lAV RdRp is a heterotrimeric complex of three subunits: polymerase basic protein 1 (PBl), polymerase 
basic protein 2 (PB2), and polymerase acidic protein (PA). PBl contains conserved motifs typical of RdRp, and 
possesses the polymerization activity'""^'. The PB2 and PA subunits are involved in the initiation of transcription, 
by binding to and cleaving capped host pre-mRNAs, respectively^^"^'. In addition to these three RdRp core 
proteins, the nucleoprotein (NP) interacts with the PBl and PB2 subunits, binds RNA, and is required for 
transcription of viral genes'''' Finally, the RdRp also requires a vRNA template to carry out cap-snatching of 
host pre-mRNAs'^. The vRNAs contain 13 nucleotides at the 5' end and 12 nucleotides at the 3' end, which are 
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Figure 1 | Library of lAV transcripts prepared for deep sequencing by Illumina HiSeq. Human (A549) and mouse (M-1) cells were infected with the A/ 
HongKong/1/1968 (H3N2) strainoflAV for four hours. The mRNA was extracted, an RNA oligonucleotide (yellow box) wasligatedto theS' end, and the 
mRNAs were reverse-transcribed using a poly-dT primer. PGR amplification was performed on each lAV cDNA using the 5' RACE primer and a 
gene-specific primer that annealed 20-40 nucleotides downstream of the ATG codon (highlighted in red). The region corresponding to the heterogeneous 
host-derived 5' end of viral mRNA is shown in black. Four-nucleotide identifier tags (blue boxes) were added to each library for multiplex sequencing, 
and adapter sequences for sequencing (red boxes) were added onto the ends of the DNA. 



highly conserved between all eight segments across all influenza A 
strains^*". These nucleotides form a partially base-paired vRNA pro- 
moter, which has been proposed to interact with lAV RdRp 
subunits'^. 

PA-mediated cleavage occurs at a phosphodiester bond 10-15 nt 
downstream of the cap structure of host pre-mRNAs. The N-ter- 
minal region of PA adopts a folding similar to resolvases and type 
II restriction endonucleases, and possesses the endonuclease activ- 
jjy29,3i yitro studies have suggested that the N-terminal region of 
PA preferentially cleaves RNA at a phosphodiester bond 3' end of a 
guanine (G) residue, although it is unknown if the same preference 
occurs in vivo with the complete RpRp complex""*. The lAV RdRp can 
also use the dinucleotide AG as primer to initiate complementary 
RNA synthesis in vitro^'^'*^. Other studies have proposed that the viral 
endonuclease cleaves host mRNA after a purine residue; lAV RdRp 
was found to preferentially use CA-terminated capped fragments to 
initiate complementary RNA synthesis and add a G residue directed 
by the complementary C residue located at the 3' end of the vRNA 
template, which has the sequence 3'-UCG^'"''''^'''^. Initiation by the 
addition of a C directed by the G at position 3 in the vRNA template 
has also been observed''''''\ To substantiate these findings and to 
clarify the lAV endonuclease sequence specificity, a more extensive 
study of the host-derived primers used during viral transcription in 
infected cells is needed. 

In this study, we investigated the characteristics of the host- 
derived sequences located at the 5' end of the mRNAs of the clinical 
human isolate A/HongKong/1/1968 (H3N2) to determine whether 
RNA selection occurs during cap-snatching of host pre-mRNAs by 
the viral RdRp. To this end, we performed high-throughput sequen- 
cing of the host-derived sequences located at the 5' ends of the eight 
lAV mRNAs early after infection of both human lung epithelial 
(A549) and mouse kidney epithelial (M-1) cells. We investigated 
the nature of nucleotide motifs enriched in the host primers, and 
observed noticeable differences in both the length distributions and 
the identity of the host primers used by the eight viral mRNAs. 
Despite these differences, our analysis suggests that most of the host 
primers originate from highly abundant host mRNAs. Overall, our 
results suggest a new layer of complexity within the A/HongKong/1/ 
1968 (H3N2) cap-snatching mechanism, wherein cellular RNPs 
could be used to recruit the RdRp:vRNA complexes to specific sets 
of genes/pre-mRNAs. 

Results 

High-throughput sequencing of the 5' UTR of influenza A/ 
HongKong/1/1968 (H3N2) virus mRNA and extraction of the 
heterogeneous sequences. We performed high-throughput sequenc- 
ing of the host-derived 5' ends located on the eight mRNAs of A/ 
HongKong/1/1968 (H3N2) following infection of both human lung 
epithelial (A549) and mouse kidney epithelial (M-1) cells. To obtain 
information about host pre-mRNAs used mainly at the earUest step 



of the viral replication cycle (i.e. during early viral mRNA synthesis), 
both cell lines were infected by A/HongKong/1/1968 (H3N2) at high 
MOI, and polyadenylated mRNA was extracted four hours after 
infection. We then selectively ligated an RNA oligonucleotide to 
the 5 '-ends of capped mRNAs, reverse-transcribed the mRNAs 
using a poly-dT primer, and amplified each lAV cDNA by PGR 
using the 5' RACE primer and gene-specific primers located just 
downstream of the translation initiation codons. A subsequent 
round of PGR amplification was performed with primers contain- 
ing lUumina adapter sequences and barcodes for multiplexing the 
samples. To avoid PGR over- amplification, the number of PGR 
cycles was kept to the minimum required to observe a band by 
agarose gel electrophoresis. The PGR products were gel-extracted, 
and submitted for Sanger sequencing, which allowed us to confirm 
the identities of the sequences (Fig. 1). AU the cDNAs were 
subsequently mixed and sent for high-throughput sequencing using 
the lUumina HiSeq 2000 System, which produced 154,826,647 reads 
of 100 nucleotides (nts) in length from the library. 

The reads were divided into their respective host and viral groups 
based on their barcodes and the sequences of the non-coding regions 
at the 5' ends of each viral mRNA. Approximately 52.5% of the reads 
obtained did not include sequences corresponding to any of the eight 
lAV non-coding regions. Because the sequences on lAV mRNA used 
for amplification have a low GG content (45-50%), a relatively low 
primer annealing temperature had to be used during PGR amplifica- 
tion, which might have resulted in co-amplification of unrelated host 
mRNA. Overall, we obtained 28.7 and 44.8 million reads corres- 
ponding to lAV mRNA from human and mouse cells, respectively. 
The inset of Figure 2 shows the proportion of each read for the 
different lAV mRNAs obtained from human cells (similar data 
derived from mouse cells are presented in Supplementary Fig. 1). 
Unequal amounts of reads were obtained for each transcript, which 
likely reflects the differences in the amount of DNA mixed before 
deep sequencing. 

Analysis of the sequences between the ligated RAGE primer and 
position G2 on lAV mRNA (herein referred to as "heterogeneous 
sequences" and indicated by the white characters on a black back- 
ground in Fig. 1) indicated that 91.8% of the heterogeneous 
sequences had lengths ranging from 10 to 15 nt, with main peaks 
at 11-12 nts, and that the length distribution was similar in the 
samples derived from both hosts (Fig. 2 and Supplementary Fig. 
1). A small population of sequences was shorter than 10 nt (about 
6.7%), and likely represents the result of amplification of degraded 
RNA. For this reason, they were removed from subsequent analyses. 
Interestingly, the length distributions varied between the different 
transcripts. We observed main peaks at 1 1 nt for the mRNAs coding 
for PBl and PB2, and at 12 nt for those coding for NA, NP, PA, NS 
(Fig. 2). Similar variations were also observed in lAV mRNAs iso- 
lated from mouse cells (Supplementary Fig. 1), suggesting differences 
in viral RdRp-mediated cleavage and/or host mRNAs used during 
cap-snatching of this influenza strain. 
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Figure 2 | Length distribution of the heterogeneous sequences obtained from infected A549 cells. All sequences representing each of the eight lAV 
transcripts isolated from infected A549 cells and located between the ligated RACE primer and position G2 on lAV mRNA are included. Inset: proportion 
of lAV sequences corresponding to each transcript obtained following high-throughput sequencing. 



Analysis of the heterogeneous sequences located at the 5' end of 
the eight viral mRNAs. To determine whether the heterogeneous 
sequences contain specific RNA motifs, we calculated the nucleotide 
frequencies observed from 15 nt upstream of the G2 and up to 5 nt 
downstream of G2 for all lAV mRNAs (Fig. 3 and Supplementary 
Table 1). Sequences that appeared only once (1.4%) were removed 
from this and all subsequent analyses, to avoid the contribution of 
mutations that might have been generated by the protocols used. 
While the virus-encoded sequence was conserved (5' GC[G/ 
A] AAA), the heterogeneous sequences showed a high degree of 
divergence in sequence, consistent with what we observed by 
Sanger sequencing of the amplicons (Fig. 1). Despite variability, we 
observed a nonrandom distribution of the nucleotide frequencies 
within the heterogeneous sequences. The nucleotide immediately 
upstream of G2 was a purine in the majority of the sequences (67.5 
+ /— 2.0%). While an A was the enriched residue in most of the 
transcripts, PB2 mRNA exhibited a preference for G. We also 
observed a preference for C (63.7 +/— 6.7%) and G (56.9 +/ — 
5.7%) residues at location -2 and -3 upstream of G2 in all lAV 
mRNAs, with the exception of those coding for PBl and PB2. By 
analyzing these three positions together, we observed a preference for 
a GCA trinucleotide at the 3 ' end of the heterogeneous sequence in 
those mRNAs (Supplementary Fig. 2). For PB2 mRNAs, the G 
residue located immediately upstream of G2 was preceded 
predominantly by a GCA motif, causing a shift in the position of 
the GCA trinucleotide as compared to the other lAV mRNAs, and 
producing a heterogeneous sequence for this lAV mRNA terminat- 
ing with a GCAG tetranucleotide as the most abundant motif 

Further upstream of this motif, the heterogeneous sequences 
showed a bias towards G/C nucleotides for all lAV mRNA leaders 
(63.8 +/— 9.4%). Specifically, the fragments used to prime most lAV 
mRNAs showed a preference for G-rich primers, while those used to 



prime PBl mRNA were C/U-enriched (Fig. 3 and Supplementary 
Table 1). Sequence motifs were comparable between human and 
mouse -derived samples for each transcript, suggesting a common 
bias towards similar sequences in each species (Supplementary Table 
2 and Fig. 3-4). Finally, an enrichment in adenines just upstream of 
the G/C-rich region was observed. This correlates with the 5' -ends of 
most of the heterogeneous sequences (ranging from 9 to 15 nt in 
length) , as represented by the grey bars showing the percentage of the 
population of reads included in the calculation (Fig. 3 and 
Supplementary Fig. 3). This enrichment likely reflects the preference 
for this nucleotide during transcription initiation by RNAP 11*'', and 
also provides evidence that our procedure to ligate an RNA oligonu- 
cleotide selectively to the 5 '-end of 5 '-capped mRNAs was effective. 

When we compared the heterogeneous sequences directly, our 
results indicated a minimal overlap in their identities among the 
different viral mRNAs (Fig. 4a and Supplementary Fig. 5a). Using 
pair-wise comparisons of the populations of sequences with tests, 
our analysis revealed significant differences among the heterogen- 
eous sequences located at the 5 ' end of the eight I AV mRNAs, with all 
p-values smaller than 10"'^. One possibility to explain these diver- 
gences is that our sequencing sampling depth might have been insuf- 
ficient. To estimate the complexity of the sequencing libraries, we 
calculated rarefaction curves for each mRNA group. Supplementary 
Fig. 6 and 7 show rarefaction curves representing the number of 
unique heterogeneous sequence variants as a function of the number 
of reads obtained for both human and mouse-derived samples. For 
all samples, the rarefaction curves approached a plateau correspond- 
ing to their respective number of unique variants, indicating that the 
sampling depth was sufficient and that more sampling is unlikely to 
yield additional sequence variants. Taken together, the observed 
divergences in length distributions, nucleotide frequencies and 
sequence identities strongly suggest that the different vRNA tem- 
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Figure 3 | Sequence variation present in the heterogeneous sequences located at the 5 ' -end of viral mRNAs obtained from infected A549 cells. For each 
transcript, the top panel shows the relative nucleotide frequency at each position (log scaled), and the bottom panel shows the Logo representation of the 
nucleotide variation. The nucleotides are numbered according to the host/virus junction (blue line), defined as the phosphodiester bond located 
upstream of position G2 on lAV mRNA. The grey bars represent the percentage of the population of reads included in the calculation. Only sequence 
reads that appeared at least twice were used in the analysis. For all LAV mRNAs, only nucleotide frequencies observed from 15 nt upstream of the G2 and 
up to 5 nt downstream of G2 are represented. All motifs are represented in the 5 ' to 3 ' orientation. 



plates of this lAV strain use different host mRNAs during cap- 
snatching and/or transcription initiation of viral mRNAs. 

Mapping the heterogeneous sequences to known transcription 
start sites (TSS) and identification of the targeted host genes. To 

identify the origin of the heterogeneous sequences and to gain 



information about sequences further downstream of the cleavage 
site, we attempted to map the heterogeneous sequences on the 
human genome. However, because the length of the heterogeneous 
sequences was small (10-15 nt), we were not able to successfully 
use standard procedures for gene identification. As an alterna- 
tive approach, we decided to map the fragments to regions 
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Figure 4 | Human sequences and genes used by the eight viral transcripts. 

(a) Heatmap representation showing the distribution of the heterogeneous 
sequences (vertical axis) associated with each lAV transcript, (b) Heatmap 
representation showing the distribution of the host genes (vertical axis) 
potentially used by each lAV transcript. The heterogeneous sequences/host 
genes are hierarchically clustered. Data in each column have been 
normalized to the maximum value in that column. To simplify the 
representation, only sequences/genes that appears at least 1000 times were 
used. 

surrounding TSS, since the fragments originate from the 5' ends of 
capped host mRNAs. To this end, we used TSS previously reported 
for A549 cells derived by Gene Identification Signature (GIS) paired- 
end ditag (PET) sequencing, as part of the ENCODE Transcriptome 
Project. Using these genomic locations, we mapped the 
heterogeneous sequences to windows located from —100 to 
+ 100 nt around the TSS using standard alignment procedures. 
Additionally, for each fragment, we selected the sequence closest to 



the TSS, and whenever more than one solution resulted from this 
filtering step, we selected one randomly. Altogether, we mapped 
56.2% of the heterogeneous sequences to 132,764 TSS associated 
with 13,229 poly-A mRNAs expressed in A549 cells. 

Using this strategy, we found that the heterogeneous sequences 
derived from the A/HongKong/1/1968 (H3N2) mRNAs mapped to 
different sets of genes (Fig. 4b). We observed significant differences 
among the eight lAV mRNAs by pair-wise comparisons of the popu- 
lations of genes using tests (p-values smaller than 10"'^). This is in 
agreement with the difference in sequence identity we observed, and 
also suggests that the different vRNA templates are associated with 
different host pre-mRNAs during cap-snatching. We then per- 
formed enrichment analysis of Gene Ontology (GO) terms to assess 
whether the different vRNA templates target different biological 
processes. Despite significant differences among the genes associated 
with the eight viral vRNAs, pair-wise comparisons of the lists of GO 
terms using tests indicated no significant differences (all p-values 
were greater than 0.05; Fig. 5). To account for bias due to the gene set 
we used, we also performed a GO terms enrichment analysis from a 
host mRNA expression profUe obtained from mock-infected cells. 
Comparison of GO terms corresponding to genes targeted by the 
virus and those associated with expression profiles from mock- 
infected cells indicated that cap-snatching by all A/HongKong/1/ 
1968 (H3N2) transcripts targets genes that are most abundant, and 
likely highly expressed (i.e. no significant difference was observed in 
the enrichment of GO terms between host genes used by A/ 
HongKong/1/1968 (H3N2) vs. host mRNA expression profiles 
obtained from mock-infected cells). Similar results were obtained 
from the lAV mRNAs isolated from mouse cells (Supplementary 
Fig. 5b and 8). 

Identification of nucleotides enriched in host pre-mRNAs 
downstream of the heterogeneous sequences. To determine 
whether a specific RNA motif exists at the cleavage site, we used 
the results of the mapping of the heterogeneous sequences to 
known TSS to calculate nucleotide frequencies for host mRNA 
locations downstream of the 3' end of the heterogeneous 
sequences. The positions located immediately downstream of the 
heterogeneous sequences showed an enrichment in G and C 
nucleotides (Fig. 6). Because we previously observed a preference 
for a CA dinucleotide at the 3 ' end of the heterogeneous sequences 
(immediately upstream), we then calculated the frequencies of the 
four-nucleotide motifs composed of the last two nucleotides of the 
heterogeneous sequences and the first two nucleotides located 
immediately downstream. Using the frequencies of all 
tetranucleotides present in the regions located —100 to +100 nt 
around the TSS as negative controls, we calculated that the CAjGC 
motif was the most enriched in our dataset (Supplementary Fig. 9). 
However, this enrichment was not uniform among all viral mRNAs. 
Similar results were observed in the sequences derived from the viral 
mRNAs isolated from mouse cells, where the C[A/G]|GC motif was 
the most enriched as compared to all tetranucleotides present in the 
regions located —100 to +100 nt around the TSS of the mouse 
genome (Supplementary Fig. 10). These findings suggest that such 
a motif might be preferentially used by A/HongKong/ 1/1968 (H3N2) 
during cap-snatching and viral transcription in these two hosts. 

Discussion 

Using information derived from cloning and sequencing of a small 
number of lAV mRNAs generated in vitro or extracted from infected 
cells, several hypotheses have been proposed on the sequence spe- 
cificity of the lAV endonuclease and subsequent initiation of viral 
mRNA transcription by host capped oligonucleotides'. However, the 
gene origin of the host leaders and information of the sequences 
downstream from the cleavage sites remained unknown. Here, we 
performed high-throughput sequencing of the 5' ends of total viral 
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Figure 5 | Enrichment of Gene Ontology (GO) terms corresponding to the genes used by influenza A/HongKong/ 1/1968 (H3N2) virus cap-snatching 
in human cells. Data are presented as level 3 GO categorization for biological process. The GO terms are hierarchically clustered. Data in each column 
have been normalized to the maximum value in that column. To account for bias due to the gene set we used, mRNA expression profile of mock infected 
cells (i.e. "MOCK) was used for the GO terms enrichment analysis. 
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Figure 6 | Analysis of nucleotide sequences enriched at the cleavage site 
in targeted human pre-mRNAs. Logo representation of the last three 
nucleotides of the heterogeneous sequence and the next three nucleotides 
found in the host pre-mRNA downstream of the heterogeneous sequences. 
The nucleotides are numbered according to the heterogeneous sequences/ 
pre-mRNA junction (blue line). The "ALL" dataset represents the sum of 
all the individual datasets. AU motifs are represented in the 5' to 3' 
orientation. 

mRNA isolated from both human (A549) and mouse (M-1) cells 
infected with A/HongKong/ 1/1968 (H3N2) to investigate the char- 
acteristics of the 5' terminus of viral mRNA early after infection. 

Analysis of the heterogeneous sequences located at the 5' end of 
the eight viral mRNAs indicated that most had lengths ranging from 
10 to 15 nt, with main peaks at 11-12 nts. This result is consistent 
with previous studies showing that lAV RdRp generally cleaves host 
mRNA 10-15 nt downstream of the cap''''''^'"*'"' Interestingly, the 
host-derived sequences in both PBl and PB2 mRNAs were smaller 
by one nucleotide (peaking at 1 1 nt) when compared to the other 
viral mRNAs. This was equally apparent in viral mRNA isolated 
from A549 and M-1 cells, and could be related to the difference at 
position 4 in the 3' end of PBl and PB2 vRNA templates (C vs. U in 
all other segments) in the lAV strain used. This difference could also 



explain the observed disparity in the nucleotide frequency distribu- 
tions of the PBl and PB2 mRNA host leaders as compared to the 
other six transcripts. Unlike the strain used in this study, most lAV 
strains have a C at position 4 in all three polymerase segments 
(including PA); hence it would be interesting to investigate the role 
of this nucleotide in host leader selection. 

Although these sequences were heterogeneous, nucleotide fre- 
quency distributions were not random. For all of lAV mRNAs, the 
host primers showed a content bias towards G/C nucleotides. 
Although it is possible that high GC content might be used as a 
determinant during cap-snatching, it is likely that this simply reflects 
the high GC content frequently reported around TSS™. In fact, cal- 
culation of nucleotide frequencies observed in the windows located 
from —100 to +100 nt around the TSS indicated a GC content of 
63.5%, which is similar to the GC content calculated for the hetero- 
geneous sequences (i.e. 63.8 +/— 9.4%). We observed an enrichment 
in purines at a location just upstream of G2. While an A was the 
preferred residue in most of the transcripts, PB2 mRNA exhibited a 
preference for G. These results are consistent with several studies that 
have shown a preference for A and G residues just upstream of 
Q25,6.42,5i-53 ^ QQ jQotlf preceded this purine in a large number of 
sequences, making GCA trinucleotide the most abundant motif 
found at the 3' end of the heterogeneous sequences. This result is 
in agreement with previous reports, which found that primers with 
CA at their 3 '-ends are preferred for transcription initiation in 
infected cells and in vitro, and that CA-terminated capped fragments 
can be used for viral mRNA synthesis in yitro^-'^^-*^-*^-^*. For PB2 
mRNA, a GCAG tetranucleotide was found instead of a GCA trinu- 
cleotide. The observed shift in the GCA trinucleotide, as compared to 
the other lAV mRNAs, is in line with a prime-and-realign mech- 
anism'"*'^^ during PB2 mRNA synthesis; presumably, an initial addi- 
tion of a G residue directed by the complementary C located at the 3' 
end of the vRNA template is followed by a realignment of the nascent 
chain, which results in the observed duplication of the G residue. 
This might explain the enrichment in G observed at the location just 
upstream of G2. 

By mapping the heterogeneous sequences to TSS previously 
reported for poly-A mRNAs expressed in A549 cells, we found that 
CA|GC was enriched at a location corresponding to the last two 
nucleotides of the heterogeneous sequences and the next two nucleo- 
tides of the host pre-mRNA. Because the G residue just downstream 
of the 3 ' end of the heterogeneous sequences might correspond to the 
conserved G2, it is possible that a large number of cap-snatched 
fragments have a 3' end corresponding to CAG and that the lAV 
RdRp endonucleolytic cleavage occurs between G and the down- 
stream C. This is consistent with studies showing that lAV RdRp 
usually cleaves after a purine residue^ '"'", that lAV RdRp endonu- 
cleolytic cleavage is highly efficient for substrates that carry GC 
motifs^'', and that the purified N-terminal domain of PA selectively 
cleaves after G residues of 5'-GC-3' motifs'". However, because the 
vRNA template is required to activate endonuclease cleavage and 
transcription''^, our results cannot exclude the possibility that the 
observed enrichment of this G residue is determined by the comple- 
mentarity of the host primer to the 3' end of the vRNA template. In 
addition to this G, our analysis indicates that heterogeneous 
sequences terminating with either CA or CAG are enriched. This 
supports the idea of base pairing to the 3' end of the lAV genome to 
allow elongation from the host primers during lAV mRNA tran- 
scription initiation'"''''''''"*'^'. Specifically, primers with nucleotides 
complementary to the 3' end of lAV vRNA template (i.e. CA, AG, 
GG and AGC) were used by lAV RdRp to transcribe lAV mRNA in 
vitro^'^"'^". This enrichment of sequence repeats that are comple- 
mentary to the 3' end of the lAV RNA templates also supports a 
prime-and-realign mechanism for transcription of this lAV strain, as 
previously observed in the leader sequence of viruses that use cap- 
snatching""'"'"'''. 
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Unexpectedly, our results indicated a divergence in the host 
mRNAs/genes that are used by the eight viral transcripts, and this 
was observed in both human- and mouse-derived samples. This was 
revealed by noticeable differences in length distributions, nucleotide 
motifs, identity of the heterogeneous sequences, and mapping the 
reads to known TSS. This divergence can occur during cap-snatching 
and/or transcription initiation of viral mRNAs. Because they do not 
target the same host mRNAs, our results suggest negligible competi- 
tion amongst RdRp:vRNA complexes for individual host mRNA 
templates during cap -snatching. Although the source of this diver- 
gence is unknown, it is likely associated with the vRNA template 
itself, or with cellular RNPs binding selectively to each of the eight 
vRNAs. It is thus tempting to suggest a new layer of complexity 
within the A/HongKong/1/1968 (H3N2) cap-snatching mechanism, 
wherein cellular RNPs could be used to recruit the RdRp;vRNA 
complexes to specific sets of genes/pre-mRNAs, or to selectively 
localize the RdRp:vRNA complexes. Such a model is supported by 
a recent observation showing that the different vRNAs localized in 
distinct sites in the nucleus^^. Because, the length distributions, the 
nucleotide motifs for each A/HongKong/1/1968 (H3N2) mRNA, 
and the shift observed for PB2 mRNA are very similar between the 
two cell lines tested, this also suggests conservation of such putative 
RdRp:vRNA complexes to efficiently cap-snatch the same motifs/ 
host pre-mRNAs in these two hosts. Noteworthy, our analysis sug- 
gests that despite divergence in the host mRNAs/genes used during 
cap -snatching, the majority of the host primers originate from the 
most abundant genes, which are likely highly expressed. It is possible 
that by targeting these genes, transcription of A/HongKong/1/1968 
(H3N2) mRNA modulates globally the expression of abundant host 
proteins, which might contribute to host shut-off. 

In conclusion, we performed high throughput sequencing of A/ 
HongKong/ 1/1968 (H3N2) mRNA from infected cells early after 
infection. Our analysis provided important information about the 
primers used to initiate viral mRNA transcription, and sequence 
specificity of the viral endonuclease. In addition to substantiating 
several previous findings, this study provided evidence for vRNA 
template partitioning during cap-snatching by this lAV strain. 
Although two cellular systems were used for infection by A/ 
HongKong/ 1/1968 (H3N2), further studies are needed to generalize 
our findings to other lAV strains, or other RNA viruses using cap- 
snatching to provide primers for viral transcription. More impor- 
tantly, our findings might have far reaching consequences towards a 
better understanding of the molecular mechanism governing the first 
step of lAV transcription, and the global inhibition of the expression 
of cellular genes. 

Methods 

Cells and viruses. M-1 (mouse kidney epitlielium, ATCC) and A549 (liuman lung 
epithelium, ATCC) cells were cultured in Dulbecco's modified Eagle's medium 
supplemented with 10 % fetal bovine serum, 1 mM sodium pyruvate, 10% (v/v) non- 
essential amino acids (Gibco) and 2 mM L-glutamine. All cells were incubated at 
^VC in the presence of 5% CO2. The H3N2 human influenza isolate A/Hong Kong/1/ 
1968 was a clonal derivative that had been previously sequence characterized^^. 
Viruses were grown in 10 day old specific pathogen free embryonated chicken eggs 
(Canadian Food Inspection Agency, Ottawa, Ontario). Viruses were titrated by 
plaque assay in MDCK cells as described previously^^. 

Library Preparation. M-1 and A549 cells were infected with lAV at an MOI of 2 pfu 
per cell. Cells were collected at 4 h post-infection, washed twice with IX PBS, and 
polyadenylated RNA was extracted using the Dynabeads mRNA Direct Kit (Ambion) 
according to manufacturer's instructions. Purified mRNA was subjected to 5' rapid 
amplification of cDNA ends (RACE) using the ExactSTART Eukaryotic mRNA 5' & 
3' -RACE Kit (Epicentre Biotechnologies) according to manufacturer's protocol. 
Briefly, —0.3 |j.g of mRNA was dephosphorylated to convert uncapped RNA into 
unligatable 5'-hydroxyl RNA. The mRNA was then treated with pyrophosphatase to 
remove the 5' cap, and an RNA oligo was ligated to the monophosphate mRNA. The 
5'-oligo-tagged RNA was subjected to reverse transcription using MMLV reverse 
transcriptase (NEB) and amplified by 5 cycles of PCR using kit-provided primers. 
PCR products were cleaned using the Gel/PCR DNA Fragments Extraction Kit 
(Geneaid), and a fraction ( — 1/10*) of the product was used as a template for PCR 
using an lAV mRNA-specific primer located just upstream of the start codons of the 



genes. After about 25 cycles of PCR, products were gel-purified using the Gel/PCR 
DNA Fragments Extraction Kit (Geneaid), and a fraction (—1/25^') was used as 
template in another round of PCR (16-20 cycles) to add sequence identifier tags for 
multiplexing, and adapter sequences compatible with the Illumina Sequencing 
platform. PCR products were gel-purified and verified by Sanger Sequencing 
(StemCore Laboratories, Ottawa, Canada). Products were then mixed in 
approximately equal proportions and a single sample was sent for deep sequencing 
using the Illumina HiSeq 2000 System (McGill University and Genome Quebec 
Innovation Centre, Montreal, Canada). 

Analysis of the 5' UTR of lAV mRNA. For each sequence, the name of the sequence, 
the composition in nucleotides and the sequencing quality score were stored in a 
database. Based on both their multiplexing tags and the transcript- specific sequences, 
the sequences were also divided into their respective host and viral mRNA groups. In- 
house Perl scripts were used for aU data extraction and to obtain statistics on fragment 
length and nucleotide composition. As a first step, the sequences found between the 
primers used for PCR amplification were extracted. Then the heterogeneous 
sequences (i.e. upstream of G2) were extracted by searching for the conserved 
sequences corresponding to G2 to G12 on lAV mRNA and by accepting two 
mutations. To identify the origin of the heterogeneous sequences, in house Perl- 
scripts were used to synthesize an artificial genome composed of sequence windows 
comprised from - 100 to + 100 nt around the TSS from the hgl9 and mmlO genomes 
of human and mouse, respectively. For TSS of A549 cells, data from the Gene 
Identification Signature (GIS) paired-end ditag (PET) sequencing as part of the 
ENCODE Transcriptome Project were used (GEO accession number: GSM1006902). 
For mouse cells, transcription start positions in the mmlO genome were obtained 
from the UCSC Genome Browser (http://genome.ucsc.edu). Mapping of the 
heterogeneous sequences to these mini-genomes was performed with Bowtie v. 1 .0.0^^. 
Using in-house Perl scripts, the solution closest to the TSS (and its associated gene 
identification number; i.e. GenelD) was selected, and if more than one solution 
resulted from this filtering step, one was selected randomly. For the GO terms 
analysis, the frequency of the reads corresponding to each GenelD was scaled from 0 
to 100 for each viral transcripts. The GenelD were then multiplied by the rescaled 
frequency to better reflect their enrichment and the analysis of GO terms 
corresponding to biological processes enriched was performed in R with GoProfiles 
1.20.0 (available at http://bioconductor.0rg/biocLite.R). To account for bias due to the 
gene set we used, mRNA expression profile of both uninfected A549 and M- 1 cells, as 
determined previously by microarray™, was used and GO terms enrichment analysis 
was performed as above. All further data analysis was performed using in-house R 
scripts. Pearson's tests were performed in R using pair-wise comparisons between 
the populations of the heterogeneous sequences located at the 5' end of the eight lAV 
mRNAs, as well as the populations of genes that the heterogeneous sequences mapped 
to. Pearson's tests were also used in pair-wise comparisons of the lists of GO terms 
associated with each lAV mRNA. 
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